Learn CUDA programming for NVIDIA Hopper GPUs. You will learn to build efficient WGMMA pipelines and leverage Cutlass optimizations to perform the massive matrix multiplications that power modern AI. Beyond single-chip performance, the curriculum covers multi-GPU scaling and NCCL primitives necessary for training trillion-parameter models. To get the most out of these lessons, you should have a foundational grasp of C++ syntax and linear algebra, particularly how matrices are tiled and multiplied.
- Course website:
- Course repo:
- X:
- GitHub Sponsors:
✏️ Developed byPrateek_Shukla
❤️ Support for this channel comes from our friends at Scrimba – the coding platform that's reinvented interactive learning:
0:00:00 Course Introduction
0:07:27 Table of Contents & Course Overview
0:23:30 LESSON 1 — H100 Hopper GPU Architecture
0:25:47 H100 Specifications: HBM3, Bandwidth & Power
0:26:22 Tensor Cores Overview
0:27:18 Tensor Memory Accelerator (TMA)
0:34:44 Transformer Engine
0:34:58 L2 Cache Architecture
0:35:21 GPCs, TPCs & SM Layout
0:37:00 Thread Block Clusters
0:46:22 Distributed Shared Memory
0:52:44 SM Sub-Partitions (SMSPs)
0:54:01 Warp Schedulers & Dispatch Units
1:02:37 Shared Memory & Data Movement
1:12:20 Occupancy
1:32:49 LESSON 2 — Clusters, Data Types, Inline PTX & Pointers
1:32:57 Thread Block Clusters Programming
1:42:11 Configuring Cluster Dimensions
1:48:08 Inline PTX Assembly
1:59:31 State Spaces
2:06:01 Data Types in PTX
2:07:16 Generic Pointers
2:09:59
|
Bangladesh's thriving developer scene is...
Today Quincy Larson interviews Mark Maho...
Download your free Python Cheat Sheet he...
Download your free Python Cheat Sheet he...
Learn CUDA programming for NVIDIA Hopper...
Download your free Python Cheat Sheet he...
The magic of web dev: continuously and q...
In this Astro tutorial series, you'll le...
本日はChatGPTからClaudeへ乗り換えたい人が知っておくべき知識について...
Welcome back to Code, Commit, Deploy, Re...
Download your free Python Cheat Sheet he...
MiniMax Token Plan 12% OFF: MiniMax 2....